Create tests/distributed/test_mnnvl_alltoall.py by puririshi98 · Pull Request #35241 · vllm-project/vllm

puririshi98 · 2026-02-24T22:53:18Z

all 5 tests pass on 8xh100 w/ latest nvidia stack

This is part of the NVIDIA effort to add CI to upstream github

GPU Hours Estimate:

~0.3 GPU hours
8 tests requiring 2-4 GPUs
Wall-clock: ~5 minutes

all 5 tests pass on 8xh100 w/ latest nvidia stack Signed-off-by: Rishi Puri <riship@nvidia.com>

gemini-code-assist

Code Review

The pull request introduces a new test file for MNNVL AllToAll operations, ensuring the correct functionality and initialization of FlashInfer components within a distributed environment. The tests cover manager initialization, workspace reinitialization, and the ensure_initialized method, as well as a custom communicator wrapper. The setup correctly handles multi-GPU environments and checks for necessary system capabilities like SYS_PTRACE.

One area for improvement is the broad exception handling in has_sys_ptrace_capability, which could mask underlying issues.

gemini-code-assist · 2026-02-24T22:59:02Z

tests/distributed/test_mnnvl_alltoall.py

+    except Exception:
+        pass


Catching a generic Exception can hide specific issues that might arise during the file reading or parsing of /proc/self/status. It's generally better to catch more specific exceptions (e.g., IOError, ValueError) to avoid masking other potential bugs. While this function is for capability checking, a more precise exception handling would improve maintainability and debugging.

Suggested change

except Exception:

pass

except (IOError, ValueError) as e:

# Log the error for debugging purposes, but continue with alternative checks

print(f"Warning: Error reading /proc/self/status: {e}")

mergify · 2026-02-24T23:01:44Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-25T00:36:57Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-26T00:32:02Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rishi Puri <riship@nvidia.com>

mergify · 2026-02-27T08:41:14Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-27T09:01:09Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-02-27T17:00:13Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Rishi Puri <riship@nvidia.com>

.buildkite/test_areas/distributed.yaml

Signed-off-by: Rishi Puri <riship@nvidia.com>

puririshi98 · 2026-03-09T23:23:25Z

recent changes:

tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_all2all_import PASSED                                                                                                                                                              [ 12%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_manager_initialization[2] PASSED                                                                                                                                          [ 25%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_workspace_reinitialization[2] PASSED                                                                                                                                      [ 37%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_ensure_initialized[2] PASSED                                                                                                                                              [ 50%]
tests/distributed/test_mnnvl_alltoall.py::test_alltoall_data_communication[2] PASSED                                                                                                                                                         [ 62%]
tests/distributed/test_mnnvl_alltoall.py::test_flashinfer_alltoall_data_communication[2] PASSED                                                                                                                                              [ 75%]
tests/distributed/test_mnnvl_alltoall.py::test_alltoall_deterministic_data_validation[2] PASSED                                                                                                                                              [ 87%]
tests/distributed/test_mnnvl_alltoall.py::test_custom_communicator PASSED

on dgxh100

Signed-off-by: Rishi Puri <riship@nvidia.com>

puririshi98 · 2026-03-10T22:46:11Z

Can you confirm?

Yes, this test refers to the MNNVL all2allv implementation from PR #21003.

The test file validates FlashInferAllToAllManager which:

Uses flashinfer.comm.trtllm_alltoall.MnnvlMoe APIs
Is configured via VLLM_ALL2ALL_BACKEND=flashinfer_all2allv

mergify · 2026-03-11T00:06:46Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T17:08:10Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T18:22:32Z

Hi @puririshi98, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Rishi Puri <riship@nvidia.com>

…tch-1

Signed-off-by: Rishi Puri <riship@nvidia.com>

Replace deprecated torch.cuda API calls with torch.accelerator equivalents: - torch.cuda.set_device() → torch.accelerator.set_device_index() - torch.cuda.current_device() → torch.accelerator.current_device_index() - torch.cuda.device_count() → torch.accelerator.device_count() Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Claude <claude@anthropic.com>

hjjq · 2026-03-20T16:30:20Z

How hard would it be to add flashinfer_moe_a2a (flashinfer_nvlink_one_sided in current main branch) to this test?
There was also renaming in this PR: #36022
flashinfer_all2allv -> flashinfer_nvlink_two_sided
flashinfer_moe_a2a -> flashinfer_nvlink_one_sided

Might be better to make the names consistent in this test for clarity?

puririshi98 · 2026-03-20T21:27:17Z

How hard would it be to add flashinfer_moe_a2a (flashinfer_nvlink_one_sided in current main branch) to this test? There was also renaming in this PR: #36022 flashinfer_all2allv -> flashinfer_nvlink_two_sided flashinfer_moe_a2a -> flashinfer_nvlink_one_sided

Might be better to make the names consistent in this test for clarity?

WIP

Signed-off-by: Rishi Puri <riship@nvidia.com>

hjjq

Hi @puririshi98 I've left some questions. PTAL

tests/distributed/test_mnnvl_alltoall.py

Signed-off-by: Rishi Puri <riship@nvidia.com>

Create tests/distributed/test_mnnvl_alltoall.py

a93eb3d

all 5 tests pass on 8xh100 w/ latest nvidia stack Signed-off-by: Rishi Puri <riship@nvidia.com>

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

Merge branch 'main' into patch-1

81518b4

Merge branch 'main' into patch-1

6d9f403

fix lint

a0bb03a

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

0c1d891

Merge branch 'main' into patch-1

3e549ee

puririshi98 and others added 8 commits February 27, 2026 19:33

precommit

96e3808

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

6f25a5c

Merge branch 'main' into patch-1

e067566

Merge branch 'main' into patch-1

9b5c72f

Merge branch 'main' into patch-1

fe5774a

Merge branch 'main' into patch-1

7d7ba9a

Merge branch 'main' into patch-1

9108f52

Merge branch 'main' into patch-1

ee1b67e

jasonlizhengjian mentioned this pull request Mar 6, 2026

[Tracking issue]: NVIDIA CI improvements #36264

Open

Update distributed.yaml

c9cb5f8

Signed-off-by: Rishi Puri <riship@nvidia.com>

mergify bot added the ci/build label Mar 6, 2026

jasonlizhengjian reviewed Mar 6, 2026

View reviewed changes

.buildkite/test_areas/distributed.yaml Show resolved Hide resolved

puririshi98 and others added 2 commits March 6, 2026 14:47

addressing review to move to h100 ci

9e4ea3e

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

82303e5

puririshi98 and others added 2 commits March 9, 2026 16:24

Update test_mnnvl_alltoall.py

ea9f002

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

e95863c

Merge branch 'main' into patch-1

4361115

Merge branch 'main' into patch-1

24b4f62

Merge branch 'main' into patch-1

79a991e

puririshi98 and others added 6 commits March 11, 2026 18:51

precommit

3afdc0c

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'patch-1' of https://github.com/puririshi98/vllm into pa…

f54a6b4

…tch-1

Update distributed.yaml

0e5a46c

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

f889832

Merge branch 'main' into patch-1

c4b7df0

puririshi98 added 2 commits March 20, 2026 13:09

Merge branch 'main' into patch-1

a95c606

Merge branch 'main' into patch-1

498639b

puririshi98 and others added 6 commits March 23, 2026 10:25

Merge branch 'main' into patch-1

fa4224a

Merge branch 'main' into patch-1

e1c6635

Merge branch 'main' into patch-1

b86bcb3

Merge branch 'main' into patch-1

4522644

addressing review

fc96c0e

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

2b35465

hjjq reviewed Apr 1, 2026

View reviewed changes

tests/distributed/test_mnnvl_alltoall.py Outdated Show resolved Hide resolved

tests/distributed/test_mnnvl_alltoall.py Show resolved Hide resolved

tests/distributed/test_mnnvl_alltoall.py Show resolved Hide resolved

tests/distributed/test_mnnvl_alltoall.py Show resolved Hide resolved

puririshi98 and others added 6 commits April 1, 2026 10:29

Merge branch 'main' into patch-1

aabed84

Update test_mnnvl_alltoall.py

20a7444

Signed-off-by: Rishi Puri <riship@nvidia.com>

address reviews

74280b6

Signed-off-by: Rishi Puri <riship@nvidia.com>

Merge branch 'main' into patch-1

ba10970

Merge branch 'main' into patch-1

f292258

Merge branch 'main' into patch-1

5c92cad

-    except Exception:
-        pass
+    except (IOError, ValueError) as e:
+        # Log the error for debugging purposes, but continue with alternative checks
+        print(f"Warning: Error reading /proc/self/status: {e}")

Uh oh!

Conversation

puririshi98 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 25, 2026

Uh oh!

mergify bot commented Feb 26, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

mergify bot commented Feb 27, 2026

Uh oh!

Uh oh!

puririshi98 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

puririshi98 commented Mar 10, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

hjjq commented Mar 20, 2026

Uh oh!

puririshi98 commented Mar 20, 2026

Uh oh!

hjjq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

puririshi98 commented Feb 24, 2026 •

edited

Loading

puririshi98 commented Mar 9, 2026 •

edited

Loading